General Assembly Logo Misk Logo

Kaggle Challenges with House Prices: Advanced Regression Techniques


This , Ames Housing dataset was compiled by Dean De Cock for use in data science education. It's an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges to predict the final price of each home.

For download the dataset and more info visit House Prices dataset in Kaggle

In [5]:
df
Out[5]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows × 81 columns

In [8]:
test
Out[8]:
id mssubclass mszoning lotfrontage lotarea street alley lotshape landcontour utilities ... screenporch poolarea poolqc fence miscfeature miscval mosold yrsold saletype salecondition
0 1461 20 RH 80.0 11622 Pave NaN Reg Lvl AllPub ... 120 0 NaN MnPrv NaN 0 6 2010 WD Normal
1 1462 20 RL 81.0 14267 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN Gar2 12500 6 2010 WD Normal
2 1463 60 RL 74.0 13830 Pave NaN IR1 Lvl AllPub ... 0 0 NaN MnPrv NaN 0 3 2010 WD Normal
3 1464 60 RL 78.0 9978 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN NaN 0 6 2010 WD Normal
4 1465 120 RL 43.0 5005 Pave NaN IR1 HLS AllPub ... 144 0 NaN NaN NaN 0 1 2010 WD Normal

5 rows × 80 columns

Data Cleaning :

1- lowercase the names of the columns

In [9]:
df
Out[9]:
id mssubclass mszoning lotfrontage lotarea street alley lotshape landcontour utilities ... poolarea poolqc fence miscfeature miscval mosold yrsold saletype salecondition saleprice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows × 81 columns

In [10]:
test
Out[10]:
id mssubclass mszoning lotfrontage lotarea street alley lotshape landcontour utilities ... screenporch poolarea poolqc fence miscfeature miscval mosold yrsold saletype salecondition
0 1461 20 RH 80.0 11622 Pave NaN Reg Lvl AllPub ... 120 0 NaN MnPrv NaN 0 6 2010 WD Normal
1 1462 20 RL 81.0 14267 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN Gar2 12500 6 2010 WD Normal
2 1463 60 RL 74.0 13830 Pave NaN IR1 Lvl AllPub ... 0 0 NaN MnPrv NaN 0 3 2010 WD Normal
3 1464 60 RL 78.0 9978 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN NaN 0 6 2010 WD Normal
4 1465 120 RL 43.0 5005 Pave NaN IR1 HLS AllPub ... 144 0 NaN NaN NaN 0 1 2010 WD Normal

5 rows × 80 columns

2- Find the nulls in datasets :

  • nulls of trainset
In [309]:
 
Out[309]:
<matplotlib.axes._subplots.AxesSubplot at 0x1cb76211848>
  • nulls of testset
In [26]:
 
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x2472afc54c8>

3- Use fallin() to full the nulls in the data:

  • Full the nulls founded in the previous step
  • Full the coded nulls using the description file of dataset

- Check nulls for trainset

In [25]:
df.isnull().sum()
Out[25]:
id               0
mssubclass       0
mszoning         0
lotfrontage      0
lotarea          0
                ..
mosold           0
yrsold           0
saletype         0
salecondition    0
saleprice        0
Length: 75, dtype: int64

- Check nulls for testset

In [37]:
test.isnull().sum()
Out[37]:
id               0
mssubclass       0
mszoning         0
lotfrontage      0
lotarea          0
                ..
miscval          0
mosold           0
yrsold           0
saletype         0
salecondition    0
Length: 74, dtype: int64

Visualize Numerical Columns

Saleprice with :

  • MSSubClass: Identifies the type of dwelling involved in the sale.
  • LotFrontage: Linear feet of street connected to property
  • LotArea: Lot size in square feet
  • OverallQual: Rates the overall material and finish of the house
In [38]:
 

Saleprice with :

  • OverallCond: Rates the overall condition of the house
  • YearBuilt: Original construction date
  • GarageCars: Size of garage in car capacity
  • GarageArea: Size of garage in square feet
In [39]:
 

Saleprice with :

  • GrLivArea : Above grade (ground) living area square feet
  • 1stFlrSF : First Floor square feet
  • FullBath : Full bathrooms above grade
  • TotRmsAbvGrd : Total rooms above grade (does not include bathrooms)
In [40]:
 

Saleprice with :

  • BsmtCond: Evaluates the general condition of the basement
In [337]:
 
Out[337]:
saleprice
bsmtcond
Fa 5481429
Gd 13883994
No Basement 3909157
Po 128000
TA 240574866
In [339]:
 

Saleprice with :

  • BsmtFinType1: Rating of basement finished area
In [340]:
 
Out[340]:
col_0 sum
bsmtfintype1
ALQ 0.134656
BLQ 0.083814
GLQ 0.372770
LwQ 0.042568
No Basement 0.014809
Rec 0.074007
Unf 0.277375
In [341]:
 

Saleprice with :

  • YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)
In [342]:
 

Saleprice with:

  • TotalBsmtSF: Total square feet of basement area
In [346]:
 
  • TotalBsmtSF: Total square feet of basement area
  • MasVnrArea: Masonry veneer area in square feet
In [347]:
 
In [54]:
 
In [56]:

In [57]:
 
In [60]:
 
  • Saleprice
  • SaleType: Type of sale
In [61]:
 
In [62]:
 
  • Saleprice
  • PavedDrive: Paved driveway
In [63]:
 
In [64]:
 
In [66]:
 
In [67]:
 

Result

column name accptence as predicter reason
MSSubClass F have good info but bad corr with price
LotFrontage T good info good corr with price
LotArea T good info good corr with price
verallQual F scale of rating carry many aspect- not a sold info - can't rly on it
OverallCond F scale of rating carry many aspect- not a sold info - can't rly on it
YearBuilt T good info good corr with price
GarageArea T good info good corr with price
GarageCar T good info good corr with price
GrLivArea T good info good corr with price
1stFlrSF T good info good corr with price
FullBath T good info good corr with price
TotRmsAbvGrd T good info good corr with price
GarageYrBlt T good info good corr with price
Fireplaces T good info good corr with price
2ndFlrSF T good info good corr with price
HalfBath T good info good corr with price

Categorical Variables

In [148]:
 
Out[148]:
<matplotlib.axes._subplots.AxesSubplot at 0x1cb6c33d908>
In [149]:
 
Out[149]:
<matplotlib.axes._subplots.AxesSubplot at 0x1cb6c150288>
In [150]:
 
Out[150]:
<matplotlib.axes._subplots.AxesSubplot at 0x1cb6c9ec988>
In [151]:
 
Out[151]:
<matplotlib.axes._subplots.AxesSubplot at 0x1cb715d1fc8>
In [152]:

Out[152]:
<matplotlib.axes._subplots.AxesSubplot at 0x1cb726570c8>
In [153]:
 
In [154]:
 
In [155]:
 
In [156]:
 
In [157]:
 
In [158]:
 
In [159]:
 
In [161]:
 
Out[161]:
<matplotlib.axes._subplots.AxesSubplot at 0x1cb73d6a2c8>
In [165]:
 
In [171]:
 
In [172]:
 

We will deal with our categorical variables by dummy_variables techniqe

Modeling

  • Importing needed libraries
  • Assign target and features for modeling.
  • Scale data using StanderdScaler
  • Use cross validation to split data
  • Get the score using multiple models availabe in sklean.
  • Test the model in the testset.
  • Get prediction.
  • Upload the prediction in kaggle.

Using Lasso

  • Score in splitted dataset (X_train , y_train)
  • Score in splitted dataset (X_test , y_test)
In [484]:
 
0.9083543412078751
C:\Users\hsfd\Anaconda3\lib\site-packages\sklearn\linear_model\coordinate_descent.py:475: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 2906897207.8632812, tolerance: 637884887.3599926
  positive)
Out[484]:
0.8542284299729207
  • Predication on testset:
In [487]:
 
Out[487]:
array([29468713.43309826, 34731391.99716575, 46965908.97645118, ...,
       43976848.15853342, 28686772.79561864, 52078427.3670783 ])

Using RandomForestRegressor

  • Score in splitted dataset (X_train , y_train)
In [491]:
 
Out[491]:
0.9725236231851994
  • Predication on testset:
In [495]:
 
Out[495]:
array([530959.68333333, 464539.42083333, 525601.95      , ...,
       479881.96333333, 443896.21363636, 518828.45083333])

Lasso with GridSearchCV

  • Score in splitted dataset (X_train , y_train)
In [509]:
 
Out[509]:
0.754589709086485
  • Score in splitted dataset (X_test , y_test)
In [510]:
 
Out[510]:
0.7710195616389807
  • Predication on testset:
In [514]:
 
Out[514]:
array([48850258.77643203, 44641458.89057335, 51344403.54278465, ...,
       66616018.01954941, 38179949.18808202, 44425702.97012807])